Goto

Collaborating Authors

 performance aal


Contextual Sparsity with Correction for Efficient LLMs Y ang Zhou

Neural Information Processing Systems

With the blossom of large language models (LLM), inference efficiency becomes increasingly important. V arious approximate methods are proposed to reduce the cost at inference time. Contextual Sparsity (CS) is appealing for its training-free nature and its ability to reach a higher compression ratio seemingly without significant performance degradation. However, after a comprehensive evaluation of contextual sparsity methods on various complex generation tasks, we find that although CS succeeds in prompt-understanding tasks, it significantly degrades the model performance for reasoning, deduction, and knowledge-based tasks.